DIMENSION REDUCTION: PRINCIPAL COMPONENTS AND PARTIAL LEAST SQUARES

 

1. Principal Components Regression

 

> attach(Auto); library(pls)

> pcr.fit = pcr( mpg ~ . – name – origin + as.factor(origin),  data=Auto )     # Using all variables except name

> summary(pcr.fit)

Number of components considered: 8

TRAINING: % variance explained

     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps

X      99.76    99.96   100.00   100.00   100.00   100.00   100.00   100.00

mpg    69.35    70.09    70.75    80.79    80.88    80.91    80.93    82.42

 

# The “X” row shows % of X variation contained in the given number of PCs.

# The “mpg” row shows R2 (% of Y variation explained) from the PC regression. The usual linear regression on all 8 variables has the same R2 as PCR that uses all 8 principal components.

 

> reg = lm( mpg ~ .-name-origin+as.factor(origin), data=Auto )

> summary(reg)

Multiple R-squared:  0.8242

 

1a. Principal components.

 

# Let’s investigate the principal components, and how much variance they explain.

 

> X = model.matrix( mpg ~ .-name-origin+as.factor(origin), data=Auto )

> pc = princomp(X)

> summary(pc)

Importance of components:

                            Comp.1       Comp.2       Comp.3       Comp.4       Comp.5

Standard deviation     854.5664182 38.860050688 1.614144e+01 3.309297e+00 1.694518e+00

Proportion of Variance   0.9975617  0.002062789 3.559039e-04 1.495959e-05 3.922297e-06

Cumulative Proportion    0.9975617  0.999624521 9.999804e-01 9.999954e-01 9.999993e-01

                             Comp.6       Comp.7       Comp.8       Comp.9

Standard deviation     5.242357e-01 4.162175e-01 2.443204e-01 1.110223e-16

Proportion of Variance 3.754062e-07 2.366403e-07 8.153944e-08 1.683715e-38

Cumulative Proportion  9.999997e-01 9.999999e-01 1.000000e+00 1.000000e+00

 

# So, Z1, the first PC, contains 99.76% of the total variation of X variables. The first two PCs together contain 99.96%. Here is a plot of these percents called a screeplot.

 

> screeplot(pc)

 

# The actual coefficients can be obtained by prcomp().

 

> prcomp(X)

                             PC1           PC2          PC3          PC4           PC5

(Intercept)         0.0000000000  0.0000000000  0.000000000  0.000000000  0.000000e+00

cylinders          -0.0017926225  0.0133245279 -0.007294275  0.001414710  1.719368e-02

displacement       -0.1143412856  0.9457785881 -0.303312504 -0.009143349 -1.059355e-02

horsepower         -0.0389670412  0.2982553337  0.948761071 -0.043076559 -8.646402e-02

weight             -0.9926735354 -0.1207516411 -0.002454212  0.001480458  3.152970e-03

acceleration        0.0013528348 -0.0348264293 -0.077006895  0.059516278 -9.944974e-01

year                0.0013368415 -0.0238516081 -0.042819254 -0.996935229 -5.549653e-02

as.factor(origin)2  0.0001308250 -0.0024889942  0.002857670  0.022100094 -9.052576e-05

as.factor(origin)3  0.0002103564 -0.0003765828  0.004796684 -0.012089823 -1.150938e-03

                             PC6           PC7           PC8 PC9

(Intercept)         0.0000000000  0.0000000000  0.000000e+00   1

cylinders           0.9911554803  0.1211162208 -4.909265e-02   0

displacement       -0.0146594359 -0.0006512752  4.394368e-03   0

horsepower          0.0038232742  0.0034425206 -4.435100e-03   0

weight             -0.0002093216 -0.0003053766  5.729471e-06   0

acceleration        0.0168319859  0.0012233398 -1.799780e-03   0

year               -0.0001647840  0.0240346554  7.643176e-03   0

as.factor(origin)2 -0.0483462982  0.6888706846  7.229226e-01   0

as.factor(origin)3  0.1214929883 -0.7142804151  6.891098e-01   0

 

 

 

1b. Standardized scale

So, we see that the 1st principal component contains a huge portion of the total variation of X variables, and it is dominated by variable “weight”.  Of course! Looking at the data, we see that weight simply has the largest values.

 

> head(Auto)

  mpg cylinders displacement horsepower weight acceleration year origin

1  18         8          307        130   3504         12.0   70      1

2  15         8          350        165   3693         11.5   70      1

3  18         8          318        150   3436         11.0   70      1

4  16         8          304        150   3433         12.0   70      1

5  17         8          302        140   3449         10.5   70      1

6  15         8          429        198   4341         10.0   70      1

 

# For this reason, usually, X variables are standardized first (subtract each X-variable’s mean, divide by st. deviation).

 

> pcr.fit = pcr( mpg ~ .-name-origin+as.factor(origin), data=Auto, scale=TRUE )

> summary(pcr.fit)

TRAINING: % variance explained

     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps

X       56.9    73.02    84.29    92.38    97.29    98.86    99.59   100.00

mpg     71.8    73.64    73.96    79.25    79.25    80.22    81.55    82.42

 

 

1c. Cross-validation

 

# Cross-validation. By default, this is a K-fold cross-validation with K=10.

 

> pcr.fit = pcr( mpg ~ .-name-origin+as.factor(origin), data=Auto, scale=TRUE, validation="CV" )

> summary(pcr.fit)

VALIDATION: RMSEP

Cross-validated using 10 random segments.

       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps

CV           7.815    4.162    4.036    4.028    3.611    3.616    3.537    3.427    3.350

adjCV        7.815    4.161    4.034    4.026    3.607    3.613    3.533    3.422    3.346

 

# The predicted error (by cross-validation) is minimized by using all M=8 principal components.

# We can see the graph of root mean-squared error of prediction (or specify val.type)

 

> validationplot(pcr.fit)                      

 

2. Partial Least Squares.

# Similar commands, just replace “pcr” with “plsr”. M=6 components gives the lowest prediction MSE.

 

> pls = plsr( mpg ~ .-name-origin+as.factor(origin), data=Auto, scale=TRUE, validation="CV" )

> summary(pls)

Cross-validated using 10 random segments.

       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps

CV           7.815    3.994    3.616    3.540    3.395    3.379    3.351    3.364    3.362

adjCV        7.815    3.992    3.612    3.535    3.390    3.376    3.345    3.359    3.357

 

TRAINING: % variance explained

     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps

X      56.73    68.84    80.75    84.08    93.48    94.88    99.33   100.00

mpg    74.32    79.37    80.29    81.71    82.00    82.35    82.38    82.42

 

> validationplot(pls)